Skip to content

Conversation

@codeflash-ai
Copy link

@codeflash-ai codeflash-ai bot commented Nov 19, 2025

📄 13% (0.13x) speedup for get_column_tolerance in datacompy/base.py

⏱️ Runtime : 1.16 milliseconds 1.03 milliseconds (best of 151 runs)

📝 Explanation and details

The optimization replaces a nested .get() call with explicit in checks and direct dictionary access, resulting in a 12% speedup.

Key Changes:

  • Original: tol_dict.get(column, tol_dict.get("default", 0.0)) - performs up to two dictionary lookups and method calls
  • Optimized: Uses if column in tol_dict followed by direct tol_dict[column] access - eliminates redundant lookups and method call overhead

Why It's Faster:

  1. Eliminates double lookup: The original code may look up the same key twice when the column exists
  2. Reduces method call overhead: Direct dictionary access tol_dict[column] is faster than .get() method calls
  3. Short-circuit evaluation: When the column exists (common case), only one dictionary lookup is needed

Performance Characteristics:

  • Best case (column exists): 14-23% faster - avoids the nested .get() entirely
  • Default case (column missing, default exists): 3-36% slower - requires two in checks instead of one .get()
  • No match case (neither column nor default): 2-16% faster - eliminates unnecessary method calls

Impact on Workloads:
Based on the function references, this function is called in hot paths within datacompy.core._intersect_compare() and all_mismatch() - methods that process every column during dataframe comparison operations. Since these methods likely encounter existing columns more frequently than missing ones, the optimization will provide meaningful performance gains in typical data comparison workflows where most columns have explicit tolerance values.

Correctness verification report:

Test Status
⚙️ Existing Unit Tests 26 Passed
🌀 Generated Regression Tests 4573 Passed
⏪ Replay Tests 510 Passed
🔎 Concolic Coverage Tests 1 Passed
📊 Tests Coverage 100.0%
⚙️ Existing Unit Tests and Runtime
Test File::Test Function Original ⏱️ Optimized ⏱️ Speedup
test_base.py::test_get_column_tolerance_column_is_default 446ns 438ns 1.83%✅
test_base.py::test_get_column_tolerance_default 444ns 505ns -12.1%⚠️
test_base.py::test_get_column_tolerance_empty_dict 484ns 458ns 5.68%✅
test_base.py::test_get_column_tolerance_exact_match 840ns 751ns 11.9%✅
test_base.py::test_get_column_tolerance_no_default 481ns 441ns 9.07%✅
🌀 Generated Regression Tests and Runtime
# imports
from datacompy.base import get_column_tolerance

# unit tests

# 1. Basic Test Cases


def test_column_explicitly_in_dict():
    # Test that an explicitly listed column returns its value
    tol_dict = {"colA": 0.1, "colB": 0.2, "default": 0.5}
    codeflash_output = get_column_tolerance(
        "colA", tol_dict
    )  # 491ns -> 429ns (14.5% faster)
    codeflash_output = get_column_tolerance(
        "colB", tol_dict
    )  # 280ns -> 231ns (21.2% faster)


def test_column_not_in_dict_but_default_exists():
    # Test that a column not listed returns the default value
    tol_dict = {"colA": 0.1, "default": 0.5}
    codeflash_output = get_column_tolerance(
        "colX", tol_dict
    )  # 488ns -> 486ns (0.412% faster)


def test_column_and_default_not_in_dict():
    # Test that if neither the column nor default is present, returns 0.0
    tol_dict = {"colA": 0.1, "colB": 0.2}
    codeflash_output = get_column_tolerance(
        "colX", tol_dict
    )  # 471ns -> 410ns (14.9% faster)


def test_column_is_default_key():
    # Test that if column is "default", returns the value for "default"
    tol_dict = {"default": 0.7, "colA": 0.1}
    codeflash_output = get_column_tolerance(
        "default", tol_dict
    )  # 442ns -> 394ns (12.2% faster)


def test_column_and_default_same_value():
    # Test that if column not present, returns default even if default is 0.0
    tol_dict = {"default": 0.0}
    codeflash_output = get_column_tolerance(
        "colX", tol_dict
    )  # 507ns -> 490ns (3.47% faster)


# 2. Edge Test Cases


def test_empty_dict():
    # Test with an empty dictionary
    tol_dict = {}
    codeflash_output = get_column_tolerance(
        "colA", tol_dict
    )  # 517ns -> 445ns (16.2% faster)


def test_column_is_empty_string():
    # Test with column as empty string
    tol_dict = {"": 0.33, "default": 0.44}
    codeflash_output = get_column_tolerance(
        "", tol_dict
    )  # 479ns -> 405ns (18.3% faster)
    codeflash_output = get_column_tolerance(
        "colX", tol_dict
    )  # 296ns -> 383ns (22.7% slower)


def test_default_is_none():
    # Test if default is None (should return None, but function expects float, so test behavior)
    tol_dict = {"default": None}
    codeflash_output = get_column_tolerance(
        "colX", tol_dict
    )  # 453ns -> 467ns (3.00% slower)


def test_column_is_none():
    # Test if column is None (should not match any key, returns default or 0.0)
    tol_dict = {"colA": 0.1, "default": 0.2}
    codeflash_output = get_column_tolerance(
        None, tol_dict
    )  # 520ns -> 525ns (0.952% slower)


def test_tol_dict_has_non_float_values():
    # Test with non-float values in dict
    tol_dict = {"colA": "not_a_float", "default": 0.1}
    # Should return the string if column matches, even though it's not a float
    codeflash_output = get_column_tolerance(
        "colA", tol_dict
    )  # 436ns -> 399ns (9.27% faster)
    # Should return default as float if column doesn't match
    codeflash_output = get_column_tolerance(
        "colX", tol_dict
    )  # 294ns -> 379ns (22.4% slower)


def test_tol_dict_has_nested_dict():
    # Test with a nested dict as value
    tol_dict = {"colA": {"nested": "dict"}, "default": 0.1}
    codeflash_output = get_column_tolerance(
        "colA", tol_dict
    )  # 436ns -> 411ns (6.08% faster)
    codeflash_output = get_column_tolerance(
        "colX", tol_dict
    )  # 285ns -> 364ns (21.7% slower)


def test_tol_dict_with_int_values():
    # Test with integer values
    tol_dict = {"colA": 1, "default": 2}
    codeflash_output = get_column_tolerance(
        "colA", tol_dict
    )  # 433ns -> 398ns (8.79% faster)
    codeflash_output = get_column_tolerance(
        "colX", tol_dict
    )  # 275ns -> 342ns (19.6% slower)


def test_tol_dict_with_negative_values():
    # Test with negative values
    tol_dict = {"colA": -0.1, "default": -0.2}
    codeflash_output = get_column_tolerance(
        "colA", tol_dict
    )  # 433ns -> 405ns (6.91% faster)
    codeflash_output = get_column_tolerance(
        "colX", tol_dict
    )  # 271ns -> 312ns (13.1% slower)


def test_tol_dict_with_zero_values():
    # Test with zero values
    tol_dict = {"colA": 0.0, "default": 0.0}
    codeflash_output = get_column_tolerance(
        "colA", tol_dict
    )  # 432ns -> 364ns (18.7% faster)
    codeflash_output = get_column_tolerance(
        "colX", tol_dict
    )  # 296ns -> 337ns (12.2% slower)


def test_tol_dict_with_duplicate_keys():
    # Python dicts can't have duplicate keys, but test if column matches first occurrence
    tol_dict = {"colA": 0.1, "colA": 0.2, "default": 0.3}
    # Only last value for 'colA' remains
    codeflash_output = get_column_tolerance(
        "colA", tol_dict
    )  # 435ns -> 379ns (14.8% faster)


def test_tol_dict_with_special_characters_in_column():
    # Test with special characters in column name
    tol_dict = {"col$A": 0.1, "default": 0.2}
    codeflash_output = get_column_tolerance(
        "col$A", tol_dict
    )  # 466ns -> 388ns (20.1% faster)
    codeflash_output = get_column_tolerance(
        "col@B", tol_dict
    )  # 317ns -> 401ns (20.9% slower)


def test_tol_dict_with_case_sensitivity():
    # Test case sensitivity
    tol_dict = {"ColA": 0.1, "cola": 0.2, "default": 0.3}
    codeflash_output = get_column_tolerance(
        "ColA", tol_dict
    )  # 426ns -> 368ns (15.8% faster)
    codeflash_output = get_column_tolerance(
        "cola", tol_dict
    )  # 260ns -> 232ns (12.1% faster)
    codeflash_output = get_column_tolerance(
        "COLA", tol_dict
    )  # 217ns -> 290ns (25.2% slower)


def test_tol_dict_with_spaces_in_column():
    # Test column names with spaces
    tol_dict = {"col A": 0.1, "default": 0.2}
    codeflash_output = get_column_tolerance(
        "col A", tol_dict
    )  # 418ns -> 344ns (21.5% faster)
    codeflash_output = get_column_tolerance(
        "colA", tol_dict
    )  # 301ns -> 342ns (12.0% slower)


def test_tol_dict_with_numeric_column_names():
    # Test numeric column names
    tol_dict = {123: 0.1, "default": 0.2}
    codeflash_output = get_column_tolerance(
        123, tol_dict
    )  # 502ns -> 484ns (3.72% faster)
    codeflash_output = get_column_tolerance(
        "colA", tol_dict
    )  # 320ns -> 391ns (18.2% slower)


# 3. Large Scale Test Cases


def test_large_dict_all_explicit():
    # Test with a large dictionary where all columns are explicitly listed
    tol_dict = {f"col{i}": float(i) for i in range(1000)}
    for i in range(1000):
        codeflash_output = get_column_tolerance(
            f"col{i}", tol_dict
        )  # 212μs -> 185μs (14.4% faster)
    # Test a column not present
    codeflash_output = get_column_tolerance(
        "colX", tol_dict
    )  # 231ns -> 310ns (25.5% slower)


def test_large_dict_with_default():
    # Test with a large dictionary and a default value
    tol_dict = {f"col{i}": float(i) for i in range(999)}
    tol_dict["default"] = 3.14159
    for i in range(999):
        codeflash_output = get_column_tolerance(
            f"col{i}", tol_dict
        )  # 214μs -> 185μs (15.8% faster)
    # Test a column not present
    codeflash_output = get_column_tolerance(
        "colX", tol_dict
    )  # 248ns -> 391ns (36.6% slower)


def test_large_dict_all_default():
    # Test with a large dictionary where only default is present
    tol_dict = {"default": 2.71828}
    for i in range(1000):
        codeflash_output = get_column_tolerance(
            f"col{i}", tol_dict
        )  # 196μs -> 175μs (11.8% faster)


def test_large_dict_no_default():
    # Test with a large dictionary and no default value
    tol_dict = {f"col{i}": float(i) for i in range(1000)}
    codeflash_output = get_column_tolerance(
        "colX", tol_dict
    )  # 582ns -> 570ns (2.11% faster)


def test_large_dict_with_mixed_types():
    # Test with a large dictionary with mixed value types
    tol_dict = {f"col{i}": i if i % 2 == 0 else float(i) for i in range(500)}
    tol_dict["default"] = "mixed"
    for i in range(500):
        expected = i if i % 2 == 0 else float(i)
        codeflash_output = get_column_tolerance(
            f"col{i}", tol_dict
        )  # 109μs -> 93.3μs (16.9% faster)
    codeflash_output = get_column_tolerance(
        "colX", tol_dict
    )  # 234ns -> 371ns (36.9% slower)


def test_large_dict_with_long_column_names():
    # Test with long column names
    tol_dict = {f"col{'x' * 50}{i}": float(i) for i in range(1000)}
    for i in range(1000):
        codeflash_output = get_column_tolerance(
            f"col{'x' * 50}{i}", tol_dict
        )  # 241μs -> 212μs (13.6% faster)
    codeflash_output = get_column_tolerance(
        "colY", tol_dict
    )  # 247ns -> 314ns (21.3% slower)


# codeflash_output is used to check that the output of the original code is the same as that of the optimized code.
# imports
from datacompy.base import get_column_tolerance

# unit tests

# ---------------------------
# Basic Test Cases
# ---------------------------


def test_basic_column_present():
    # Test when the column is present in the dictionary
    tol_dict = {"col1": 0.01, "col2": 0.02}
    codeflash_output = get_column_tolerance(
        "col1", tol_dict
    )  # 620ns -> 567ns (9.35% faster)
    codeflash_output = get_column_tolerance(
        "col2", tol_dict
    )  # 267ns -> 217ns (23.0% faster)


def test_basic_column_not_present_with_default():
    # Test when the column is not present but "default" is
    tol_dict = {"col1": 0.01, "default": 0.05}
    codeflash_output = get_column_tolerance(
        "col2", tol_dict
    )  # 490ns -> 536ns (8.58% slower)
    codeflash_output = get_column_tolerance(
        "col3", tol_dict
    )  # 278ns -> 312ns (10.9% slower)


def test_basic_column_not_present_no_default():
    # Test when the column is not present and no "default" exists
    tol_dict = {"col1": 0.01}
    codeflash_output = get_column_tolerance(
        "col2", tol_dict
    )  # 450ns -> 426ns (5.63% faster)
    codeflash_output = get_column_tolerance(
        "col3", tol_dict
    )  # 260ns -> 289ns (10.0% slower)


def test_basic_default_column_explicit():
    # Test when "default" is explicitly queried
    tol_dict = {"col1": 0.01, "default": 0.05}
    codeflash_output = get_column_tolerance(
        "default", tol_dict
    )  # 462ns -> 453ns (1.99% faster)


def test_basic_column_present_overrides_default():
    # Test that column-specific tolerance overrides "default"
    tol_dict = {"col1": 0.01, "default": 0.05}
    codeflash_output = get_column_tolerance(
        "col1", tol_dict
    )  # 445ns -> 425ns (4.71% faster)


# ---------------------------
# Edge Test Cases
# ---------------------------


def test_edge_empty_dict():
    # Test with an empty dictionary
    tol_dict = {}
    codeflash_output = get_column_tolerance(
        "col1", tol_dict
    )  # 497ns -> 402ns (23.6% faster)
    codeflash_output = get_column_tolerance(
        "default", tol_dict
    )  # 267ns -> 228ns (17.1% faster)


def test_edge_default_is_zero():
    # Test when "default" is explicitly zero
    tol_dict = {"default": 0.0}
    codeflash_output = get_column_tolerance(
        "col1", tol_dict
    )  # 444ns -> 513ns (13.5% slower)
    codeflash_output = get_column_tolerance(
        "col2", tol_dict
    )  # 288ns -> 283ns (1.77% faster)


def test_edge_column_name_is_empty_string():
    # Test when column name is an empty string
    tol_dict = {"": 0.15, "default": 0.05}
    codeflash_output = get_column_tolerance(
        "", tol_dict
    )  # 471ns -> 420ns (12.1% faster)
    codeflash_output = get_column_tolerance(
        "col1", tol_dict
    )  # 291ns -> 404ns (28.0% slower)


def test_edge_column_name_is_none():
    # Test when column name is None
    tol_dict = {None: 0.2, "default": 0.05}
    codeflash_output = get_column_tolerance(
        None, tol_dict
    )  # 510ns -> 442ns (15.4% faster)
    codeflash_output = get_column_tolerance(
        "col1", tol_dict
    )  # 295ns -> 450ns (34.4% slower)


def test_edge_tol_dict_has_non_float_values():
    # Test when tol_dict contains non-float values
    tol_dict = {"col1": "0.1", "default": 0.05}
    # Should return the value as stored, even if not float
    codeflash_output = get_column_tolerance(
        "col1", tol_dict
    )  # 436ns -> 402ns (8.46% faster)
    codeflash_output = get_column_tolerance(
        "col2", tol_dict
    )  # 278ns -> 392ns (29.1% slower)


def test_edge_tol_dict_has_int_values():
    # Test when tol_dict contains integer values
    tol_dict = {"col1": 2, "default": 5}
    codeflash_output = get_column_tolerance(
        "col1", tol_dict
    )  # 405ns -> 386ns (4.92% faster)
    codeflash_output = get_column_tolerance(
        "col2", tol_dict
    )  # 274ns -> 371ns (26.1% slower)


def test_edge_tol_dict_has_negative_values():
    # Test when tol_dict contains negative values
    tol_dict = {"col1": -0.01, "default": -0.05}
    codeflash_output = get_column_tolerance(
        "col1", tol_dict
    )  # 407ns -> 389ns (4.63% faster)
    codeflash_output = get_column_tolerance(
        "col2", tol_dict
    )  # 301ns -> 385ns (21.8% slower)


def test_edge_tol_dict_has_nan_and_inf():
    # Test when tol_dict contains float('nan') and float('inf')
    tol_dict = {"col1": float("nan"), "col2": float("inf"), "default": 0.1}
    codeflash_output = get_column_tolerance(
        "col2", tol_dict
    )  # 453ns -> 369ns (22.8% faster)
    codeflash_output = get_column_tolerance(
        "col3", tol_dict
    )  # 318ns -> 387ns (17.8% slower)


def test_edge_tol_dict_has_multiple_defaults():
    # Test that only one "default" key is respected
    tol_dict = {"col1": 0.01, "default": 0.05, "Default": 0.1}
    codeflash_output = get_column_tolerance(
        "col2", tol_dict
    )  # 410ns -> 470ns (12.8% slower)


def test_edge_column_name_is_integer():
    # Test when column name is an integer
    tol_dict = {1: 0.5, "default": 0.05}
    codeflash_output = get_column_tolerance(
        1, tol_dict
    )  # 494ns -> 465ns (6.24% faster)
    codeflash_output = get_column_tolerance(
        2, tol_dict
    )  # 276ns -> 359ns (23.1% slower)


def test_edge_tol_dict_has_extra_keys():
    # Test when tol_dict has extra unrelated keys
    tol_dict = {"col1": 0.01, "col2": 0.02, "default": 0.05, "extra": 999}
    codeflash_output = get_column_tolerance(
        "col1", tol_dict
    )  # 422ns -> 417ns (1.20% faster)
    codeflash_output = get_column_tolerance(
        "col3", tol_dict
    )  # 303ns -> 400ns (24.2% slower)


# ---------------------------
# Large Scale Test Cases
# ---------------------------


def test_large_scale_many_columns():
    # Test with a dictionary containing 1000 columns
    tol_dict = {f"col{i}": i * 0.001 for i in range(1000)}
    tol_dict["default"] = -1.0
    # Test some random columns
    codeflash_output = get_column_tolerance(
        "col0", tol_dict
    )  # 544ns -> 529ns (2.84% faster)
    codeflash_output = get_column_tolerance(
        "col500", tol_dict
    )  # 288ns -> 287ns (0.348% faster)
    codeflash_output = get_column_tolerance(
        "col999", tol_dict
    )  # 204ns -> 345ns (40.9% slower)
    # Test a column not present
    codeflash_output = get_column_tolerance(
        "col1001", tol_dict
    )  # 232ns -> 299ns (22.4% slower)


def test_large_scale_all_default():
    # Test with a dictionary of 1000 unrelated keys and only "default"
    tol_dict = {f"foo{i}": i for i in range(1000)}
    tol_dict["default"] = 42.42
    # Should always return default for unknown column
    codeflash_output = get_column_tolerance(
        "not_in_dict", tol_dict
    )  # 479ns -> 508ns (5.71% slower)


def test_large_scale_no_default():
    # Test with a dictionary of 1000 unrelated keys and no "default"
    tol_dict = {f"foo{i}": i for i in range(1000)}
    # Should always return 0.0 for unknown column
    codeflash_output = get_column_tolerance(
        "not_in_dict", tol_dict
    )  # 456ns -> 416ns (9.62% faster)


def test_large_scale_column_is_none():
    # Test with a large dictionary and column name None
    tol_dict = {f"col{i}": i * 0.1 for i in range(1000)}
    tol_dict[None] = 123.456
    tol_dict["default"] = 0.0
    codeflash_output = get_column_tolerance(
        None, tol_dict
    )  # 580ns -> 473ns (22.6% faster)


def test_large_scale_column_is_int():
    # Test with integer column keys in large dict
    tol_dict = {i: float(i) for i in range(1000)}
    tol_dict["default"] = -999.0
    codeflash_output = get_column_tolerance(
        999, tol_dict
    )  # 597ns -> 601ns (0.666% slower)
    codeflash_output = get_column_tolerance(
        1001, tol_dict
    )  # 294ns -> 450ns (34.7% slower)


# codeflash_output is used to check that the output of the original code is the same as that of the optimized code.
from datacompy.base import get_column_tolerance


def test_get_column_tolerance():
    get_column_tolerance("", {})
⏪ Replay Tests and Runtime
Test File::Test Function Original ⏱️ Optimized ⏱️ Speedup
test_pytest_teststest_snowflake_py_teststest_polars_py_teststest_sparktest_sql_spark_py_teststest_fuguete__replay_test_0.py::test_datacompy_base_get_column_tolerance 78.5μs 73.8μs 6.40%✅
test_pytest_teststest_sparktest_helper_py_teststest_fuguetest_fugue_polars_py_teststest_fuguetest_fugue_p__replay_test_0.py::test_datacompy_base_get_column_tolerance 75.5μs 70.3μs 7.39%✅
🔎 Concolic Coverage Tests and Runtime
Test File::Test Function Original ⏱️ Optimized ⏱️ Speedup
codeflash_concolic_8h8xtkx8/tmpwed2y05m/test_concolic_coverage.py::test_get_column_tolerance 591ns 565ns 4.60%✅

To edit these changes git checkout codeflash/optimize-get_column_tolerance-mi5v2ai5 and push.

Codeflash Static Badge

The optimization replaces a nested `.get()` call with explicit `in` checks and direct dictionary access, resulting in a **12% speedup**.

**Key Changes:**
- **Original**: `tol_dict.get(column, tol_dict.get("default", 0.0))` - performs up to two dictionary lookups and method calls
- **Optimized**: Uses `if column in tol_dict` followed by direct `tol_dict[column]` access - eliminates redundant lookups and method call overhead

**Why It's Faster:**
1. **Eliminates double lookup**: The original code may look up the same key twice when the column exists
2. **Reduces method call overhead**: Direct dictionary access `tol_dict[column]` is faster than `.get()` method calls
3. **Short-circuit evaluation**: When the column exists (common case), only one dictionary lookup is needed

**Performance Characteristics:**
- **Best case** (column exists): 14-23% faster - avoids the nested `.get()` entirely
- **Default case** (column missing, default exists): 3-36% slower - requires two `in` checks instead of one `.get()`
- **No match case** (neither column nor default): 2-16% faster - eliminates unnecessary method calls

**Impact on Workloads:**
Based on the function references, this function is called in hot paths within `datacompy.core._intersect_compare()` and `all_mismatch()` - methods that process every column during dataframe comparison operations. Since these methods likely encounter existing columns more frequently than missing ones, the optimization will provide meaningful performance gains in typical data comparison workflows where most columns have explicit tolerance values.
@codeflash-ai codeflash-ai bot requested a review from mashraf-222 November 19, 2025 10:30
@codeflash-ai codeflash-ai bot added ⚡️ codeflash Optimization PR opened by Codeflash AI 🎯 Quality: High Optimization Quality according to Codeflash labels Nov 19, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

⚡️ codeflash Optimization PR opened by Codeflash AI 🎯 Quality: High Optimization Quality according to Codeflash

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant